A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization
نویسنده
چکیده
The Rocchio relevance feedback algorithm is one of the most popular and widely applied learning methods from information retrieval. Here, a probabilistic analysis of this algorithm is presented in a text categorization framework. The analysis gives theoretical insight into the heuristics used in the Rocchio algorithm, particularly the word weighting scheme and the similarity metric. It also suggests improvements which lead to a probabilistic variant of the Rocchio classi er. The Rocchio classi er, its probabilistic variant, and a naive Bayes classi er are compared on six text categorization tasks. The results show that the probabilistic algorithms are preferable to the heuristic Rocchio classi er not only because they are more well-founded, but also because they achieve better performance.
منابع مشابه
Exploration on Approaches To Email ( text ) Classification CS 350 Project
Basic theory about text categorization and information retrieval is presented and several important algorithms for text classification are describe in details, such as the Rocchio Algorithm, TFIDF classifiers and Naïve Byes Algorithm, etc. An implementation based on Rocchio Algorithm is also discussed and evaluated. It shows that this method is reasonably efficient given fairly small training d...
متن کاملAn Intelligent System for Arabic Text Categorization
Text Categorization (classification) is the process of classifying documents into a predefined set of categories based on their content. In this paper, an intelligent Arabic text categorization system is presented. Machine learning algorithms are used in this system. Many algorithms for stemming and feature selection are tried. Moreover, the document is represented using several term weighting ...
متن کاملLearning User Profiles from Text in e-Commerce
Exploring digital collections to find information relevant to a user’s interests is a challenging task. Algorithms designed to solve this relevant information problem base their relevance computations on user profiles in which representations of the users’ interests are maintained. This paper presents a new method, based on the classical Rocchio algorithm for text categorization, able to discov...
متن کاملA Relevance Feedback Method for Discovering User Profiles from Text
The huge amounts of data on the Internet often make difficult the user’s search for relevant information. For this reason, systems that are able to support users in this task could be a valuable help in this activity. Unfortunately, being able to catch user interests and represent them in a structured form is in general a problematic activity. Our research deals with the application of supervis...
متن کاملDesign and Evaluation of Approaches to Automatic Chinese Text Categorization
In this paper, we propose and evaluate approaches to categorizing Chinese texts, which consist of term extraction, term selection, term clustering and text classification. We propose a scalable approach which uses frequency counts to identify left and right boundaries of possibly significant terms. We used the combination of term selection and term clustering to reduce the dimension of the vect...
متن کامل